22 research outputs found

    These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

    Full text link
    K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer

    Assembling large, complex environmental metagenomes

    Full text link
    The large volumes of sequencing data required to sample complex environments deeply pose new challenges to sequence analysis approaches. De novo metagenomic assembly effectively reduces the total amount of data to be analyzed but requires significant computational resources. We apply two pre-assembly filtering approaches, digital normalization and partitioning, to make large metagenome assemblies more comput\ ationaly tractable. Using a human gut mock community dataset, we demonstrate that these methods result in assemblies nearly identical to assemblies from unprocessed data. We then assemble two large soil metagenomes from matched Iowa corn and native prairie soils. The predicted functional content and phylogenetic origin of the assembled contigs indicate significant taxonomic differences despite similar function. The assembly strategies presented are generic and can be extended to any metagenome; full source code is freely available under a BSD license.Comment: Includes supporting informatio

    Assembling large, complex environmental metagenomes

    No full text

    Microbial linkages to soil biogeochemical processes in a poorly drained agricultural ecosystem

    Get PDF
    Soil microorganisms mediate biogeochemical processes, but how microbial community composition influences these processes remains contested. We combined monthly sequencing of soil 16S rRNA genes and intensive measurements of nitrogen (N), carbon (C), and iron (Fe) cycling along a topographic gradient in a poorly drained intensive agricultural ecosystem (corn–soybean rotation) in the midwestern United States. Observed microbial composition changed little over time within and among years despite large differences in weather and crop type. Yet, microbial composition varied greatly with topographic location and correlated strongly with moisture, soil organic carbon (SOC), and especially pH. Microbial families, genera, and/or amplicon sequence variants often correlated significantly with measured biogeochemical processes or pools, yet different taxa within the same phylogenetic groups often responded in opposite ways, indicating a lack of ecological coherence among close relatives. Dominant phyla were generally similar across the topographic gradient but specific members showed consistent tradeoffs among locations. Ammonia oxidizing archaea and bacteria sequences varied oppositely with pH across the gradient, but their combined relative abundances remained similar, as did potential nitrification rates. Nitrospira sequences correlated positively with nitrous oxide (N2O) fluxes, suggesting a direct or indirect contribution of nitrification (or possibly comammox) to N2O production. We also found significant linkages between taxonomic groups and redox-sensitive Fe pools, indicating a role for redox variation in structuring microbial communities. Several globally dominant bacteria identified previously correlated significantly with measured biogeochemical variables, providing insights into their possible functional roles. Overall, microbial composition provided a coarse measure of several key biogeochemical functions and implicated taxa that possibly mediate these processes in a widespread agroecosystem of North America.This is a manuscript of an article published as Yu, Wenjuan, Nathaniel C. Lawrence, Thanwalee Sooksa-nguan, Schuyler D. Smith, Carlos Tenesaca, Adina Chuang Howe, and Steven J. Hall. "Microbial linkages to soil biogeochemical processes in a poorly drained agricultural ecosystem." Soil Biology and Biochemistry (2021): 108228. doi:10.1016/j.soilbio.2021.108228. Posted with permission.</p

    Microbial linkages to soil biogeochemical processes in a poorly drained agricultural ecosystem

    Get PDF
    Soil microorganisms mediate biogeochemical processes, but how microbial community composition influences these processes remains contested. We combined monthly sequencing of soil 16S rRNA genes and intensive measurements of nitrogen (N), carbon (C), and iron (Fe) cycling along a topographic gradient in a poorly drained intensive agricultural ecosystem (corn–soybean rotation) in the midwestern United States. Observed microbial composition changed little over time within and among years despite large differences in weather and crop type. Yet, microbial composition varied greatly with topographic location and correlated strongly with moisture, soil organic carbon (SOC), and especially pH. Microbial families, genera, and/or amplicon sequence variants often correlated significantly with measured biogeochemical processes or pools, yet different taxa within the same phylogenetic groups often responded in opposite ways, indicating a lack of ecological coherence among close relatives. Dominant phyla were generally similar across the topographic gradient but specific members showed consistent tradeoffs among locations. Ammonia oxidizing archaea and bacteria sequences varied oppositely with pH across the gradient, but their combined relative abundances remained similar, as did potential nitrification rates. Nitrospira sequences correlated positively with nitrous oxide (N2O) fluxes, suggesting a direct or indirect contribution of nitrification (or possibly comammox) to N2O production. We also found significant linkages between taxonomic groups and redox-sensitive Fe pools, indicating a role for redox variation in structuring microbial communities. Several globally dominant bacteria identified previously correlated significantly with measured biogeochemical variables, providing insights into their possible functional roles. Overall, microbial composition provided a coarse measure of several key biogeochemical functions and implicated taxa that possibly mediate these processes in a widespread agroecosystem of North America

    Tackling soil diversity with the assembly of large, complex metagenomes

    No full text
    The large volumes of sequencing data required to sample deeply the microbial communities of complex environments pose new challenges to sequence analysis. De novo metagenomic assembly effectively reduces the total amount of data to be analyzed but requires substantial computational resources. We combine two preassembly filtering approaches—digital normalization and partitioning—to generate previously intractable large metagenome assemblies. Using a human-gut mock community dataset, we demonstrate that these methods result in assemblies nearly identical to assemblies from unprocessed data. We then assemble two large soil metagenomes totaling 398 billion bp (equivalent to 88,000 Escherichia coli genomes) from matched Iowa corn and native prairie soils. The resulting assembled contigs could be used to identify molecular interactions and reaction networks of known metabolic pathways using the Kyoto Encyclopedia of Genes and Genomes Orthology database. Nonetheless, more than 60% of predicted proteins in assemblies could not be annotated against known databases. Many of these unknown proteins were abundant in both corn and prairie soils, highlighting the benefits of assembly for the discovery and characterization of novelty in soil biodiversity. Moreover, 80% of the sequencing data could not be assembled because of low coverage, suggesting that considerably more sequencing data are needed to characterize the functional content of soil

    Iterative low-memory k-mer trimming.

    No full text
    <p><b>The results of trimming reads at unique (erroneous) k-mers from a 5 m read </b><b><i>E. coli</i></b><b> data set (1.4 GB) in under 30 MB of RAM. After each iteration, we measured the total number of distinct k-mers in the data set, the total number of unique (and likely erroneous) k-mers remaining, and the number of unique k-mers present at the 3' end of reads.</b></p
    corecore